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Abstract — We consider a reconciliation problem, where two 
hosts wish to synchronize their respective sets. Efficient solutions 
for minimizing the communication cost between the two hosts 
have been previously proposed in the literature. However, they 
rely on prior knowledge about the size of the set differences 
between the two sets to be reconciled. In this paper, we propose 
a method which can achieve comparable efficiency without 
assuming this prior knowledge. Our method uses compressive 
sensing techniques which can leverage the expected sparsity in 
set differences. We study the performance of the method via 
theoretical analysis and numerical simulations. 

I. Introduction 

Set reconciliation occurs naturally. For example, routers 
may need to reconcile their routing tables and files on mobile 
devices may need to be synchronized with those in the cloud. 
The reconciliation problem is to find the set differences 
between two distributed sets. Here, the set difference for a 
host is defined as the set of elements that the host has but the 
other host does not. Once two hosts can find their respective 
set differences, each can use the information to solve the 
reconciliation problem by adding its difference set to the other 
or removing it from its own set to reconcile the two sets 
to their union or intersection, respectively. In this paper, for 
presentation simplicity, we consider a simpler case that a host 
just reconcile its set to the same as the set that the other host 
currently possesses. 

We describe the problem we wish to solve in mathematical 
notation. Suppose that there are two hosts, A and B, which 
possess two sets, Sa and Sb, respectively. The elements of Sa 
and Sb are from a set U CN. The difference sets for A and 
B are = Sa\ Sb and Ab = Sb \Sa, respectively. For 
example, if A has S A = {1,2,3} and B has S B = {2,3,4}, 
then we have = {1} and Ab = {4}. We denote the size of 
a set S by l^. To ease the presentation, we assume throughout 
the paper that \S A \, \S B \ < n and d = \A A \ + (A^l < n for 
some positive integer n. The method proposed in this paper 
can be naturally extended to the case of n < d < 2n by simply 
increasing the space allocation from 2n to 4n (described in 
Sec. ITTC] ). 

In the reconciliation problem, the two hosts wish to recon- 
cile their sets, by making them identical. For example, B can 
update Sb by adding elements in A^ to Sb and removing 
elements in A^ from Sb- This means, in the above example, 
once B knows A^ = {1} and Ab = {4}, B performs the 
operation of (SbUAa)\Ab- Consequently, the reconciliation 
is accomplished. 

In solving the reconciliation problem, we are mainly con- 
cerned with the communication cost, the number of elements 



required to be transmitted between the two hosts. 

A. Related Work 

A straightforward method of solving the reconciliation 
problem is that host A sends his entire set Sa to host B. After 
that, B can check and identify the set differences between Sa 
and Sb- Obviously, the communication cost for this method 

is \S A \- 

A more efficient but probabilistic method is to utilize Bloom 
filter [1]. More specifically, host A constructs a Bloom filter 
by inserting the elements in Sa to the Bloom filter and then 
sending the Bloom filter to B. With the received Bloom filter, 
B can check if the elements in Sb is in the filter and thus can 
identify A B with some probability that not all these elements 
are identified due to hash table collisions in the Bloom filter. 
Similar queries made for the remaining elements in U can be 
used to identify A a with some probability that extra elements 
are identified due to hash table collisions in the Bloom filter. 
To lower false identifications, the size of Bloom filter needs 
to be proportional to n. Therefore, the communication cost of 
this Bloom filter approach is still asymptotically the same as 
the straightforward method. 

Minsky et al. [5] developed a characteristic polynomial 
method. In this method, A sends several evaluated values of 
the characteristic polynomial cs A to B, where cs A is defined 
as cs A = ni=i (Z ~ x a) w * m X W S being elements in Sa- 
Host B does similar evaluation based on its own characteristic 
polynomial cs B - By rational interpolation, B can derive cs A 
and thus recover the set differences based on cs A 's and c$ B 's 
evaluated values. Here, given d\ + d^ + 1 pairs of (ki,fi), 
rational interpolation is to find a / = ^ satisfying f(k i ) = f i 
for each pair (ki,fi), where the polynomials P and Q are of 
degrees d\ and c?2, respectively. 

Observe that ^ = = A sends evaluated 

cs B cs A ns B -CA B ca b 

values of cc, to B, and B calculates the value of C ^ A - at each 

^ A c C Ab 

predetermined evaluation point. Once °^ A - can be recovered 
from the evaluated values of -^S the set differences can be 

CA B 

obtained by finding the roots of ca a and ca b - 

A concrete example in [5] shows how this charac- 
teristic polynomial method works. Suppose that Sa — 
{1, 9, 28, 33, 53, 61}, S B = {1, 9, 10, 28, 53}, the prior knowl- 
edge about d is available, the evaluation points {0,1,2,3} 
have been predetermined, and a proper finite field F97 has 
been chosen. Under such conditions, cs A and cs B can be 
formulated as (Z-1)(Z-9)(Z-28)(Z-33)(Z-53)(Z-61) 
and (Z - \){Z - 9)(Z - 10) (Z - 28) (Z - 53), respectively. 



JOURNAL OF JATeX CLASS FILES, VOL. 6, NO. 1, JANUARY 2010 



2 



The evaluations of cs A and cs B at four evaluation points 
are {41, 85, 65, 81]Q and {9,14,51,46} over F 97 , respec- 
tively. The values of ^ are therefore {^,ff,ff,f|} = 
{80, 13, 26, 84}. From rational interpolation's perspective, the 
value d\ + d 2 corresponds to the size <iof set differences and 
{(hji)} corresponds to {(0, 80), (1, 13), (2, 26), (3, 84)} of 
size d ± + d 2 + 1 = 4. The interpolated / = Z % M f Q +7S , 
where the roots of numerator are 33 and 61 and the root of 
denominator is 10, can be used to derive the set differences 
between Sa and Sb> An issue in this reconciliation case is 
that only the size of set differences, instead of the individual 
di and d 2 , is known and so rational interpolation cannot be 
applied directly. Nevertheless, a formula is given in Q to the 
estimates of d\ and d 2 based only on the size of set differences. 
Despite its algebraic computation over finite fields, a notable 
feature of this method is that the communication cost is only 
dependent on d, instead of n, due to the use of interpolation. 

Very recently, Goodrich and Mitzenmacher (4) developed a 
data structure, called invertible Bloom lookup table (IBLT), to 
address the reconciliation problem. IBLT can be thought of as 
a variant of counting Bloom filter [3 ] with the property that the 
elements inserted to Bloom filter can be extracted even under 
collision. With the use of IBLT, the reconciliation problem can 
be solved in approximately 2d communication cost under the 
assumption that d is known in advance. 



B. Research Gap and Contribution 

The aforementioned straightforward method and Bloom 
filter approach incur a large amount of communication cost 
when Sa is of large size. On the other hand, characteristic 
polynomial method and IBLT are efficient only when prior 
knowledge about d is available. Without this prior knowledge, 
the computation overhead of the characteristic polynomial 
method can be as large as 0(n 4 ). IBLT need to be repeatedly 
applied with progressively increasing d, incurring a wasted 
communication cost which can be as large as 0(n log n). 

We propose an algorithm, called CS-IBLT, which is a 
novel combination of compressed sensing (CS) and IBLT, 
enabling the reconciliation problem to be solved with O(d) 
communication cost even without prior knowledge about d. 
A distinguished feature of CS-IBLT is that the number of 
transmitted messages changes with adapt to the value of d, 
instead of the conventional wisdom that the correct d must be 
estimated first. Notably, this adaptive feature is attributed to 
the use of CS. 



II. Proposed Method 



First, we briefly review compressed sensing (CS) and in- 



vertible Bloom lookup table (IBLT) in Sec. II- A and Sec. 



II-B respectively. Then, we describe our proposed CS-IBLT 



algorithm in Sec. II-C We provide analysis and comparison 
between IBLT and CS-IBLT in Sees. HTHl and ULEl 



* A particular treatment needs to be taken on the evaluation point 1, but we 
omit the detail in this paper. 



A. Compressed Sensing 

Suppose that x is a s-sparse vector of length n with s <C n. 
That is, only s nonzero components can be found in x. A 
standard compressed sensing (CS) formulation is y = &x, 
where y e R m and e R mxn , with m <C n, are called 
measurement vector and measurement matrix, respectively. CS 
states that if $ is a random matrix satisfying the restricted 
isometry property and m is greater than cs log J for some 
constant c [2], then x can be reconstructed based on y with 
high probability. The vector x can be reconstructed by l\- 
minimization as follows: 



argmm||£| 

y= c I ) x 



(i) 



B. Invertible Bloom Lookup Table 

An invertible Bloom lookup table (IBLT) is composed of 
a b x 2 array, IBLT, with k hash functions, hi(-), 
hk(-). It supports three operation^] INSERT, DELETE, and 
LIST-ENTRIES. Suppose that e is a numeric value. To insert 
an element e with the INSERT operation, IBLT[hi(e), 1] is 
increased by e and IBLT[hi(e),2] is increased by 1, for all 
1 < i < k. The deletion of an element e with the DELETE 
operation is operated by decreasing IBLT[hi(e), 1] by e and 
decreasing IBLT[h i (e) J 2] by 1. The second column of IBLT 
can be treated as a counting Bloom filter [3]. LIST-ENTRIES 
is used to dump all elements currently stored in IBLT. It works 
by searching for the position 1 < i < b where IBLT[i, 2] = 1. 
If such i is found, the corresponding IBLT[i, 1] is listed 
and operation DELETE(IBLT[i, 1]) is performed. The above 
search-and-delete procedure is repeatedly performed until no 
such i can be found. With this search-and-delete procedure, 
elements under collision can still be extracted. The LIST- 
ENTRIES operation fails if the resultant IBLT is not empty. It 
succeeds otherwise. Goodrich and Mitzenmacher show in (H 
that to accommodate n elements, the length b of IBLT needs to 
be greater than 1.2n when A^is selected to be 3. This makes 
sure the LIST-ENTRIES fails with negligible probability. 

C CS-IBLT 

Recall that Sa and Sb are two sets of length n. Under CS- 
IBLT, host A first constructs an IBLT, IBLT a, of length 2n 
by inserting each element in Sa to IBLT a- (The choice of 
2n will be described in Sec. |II-D| ) Host A then constructs a 
random measurement matrix of dimension mx2n satisfying 
the restricted isometry property mentioned in Sec. |II-A| A 
calculates y A = $ - IBLT a- yA is thus an array of dimension 
m x 2. Afterwards, A repeatedly sends the rows of y A to 
B continuously until it receives a positive acknowledgement 
from B (described below). 

^As IBLT is designed originally for storing key-value pairs, it actually 
supports GET operation. The purpose of GET is to return the value for a 
given key. Since we do not deal with key- value pairs, we omit the description 
of the GET operation for the ease of presentation. 

■l-When k = 4, 5, 6, and 7 are used, approximately 1.3n, lAn, 1.6n, and 
1.7n should be allocated, respectively. The rationale behind this is that for 
fixed IBLT size, larger k implies more collision. To be able to perform the 
element extraction, collision cannot too much although collision is allowed 
in IBLT. Thus, when larger k is used, more space allocation is required. 
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Host B constructs IBLTb, 3>, and 1/5 ina similar manner. 
Note that with a seed commonly shared between A and B, 
their generated can be the same for each row. Denote the 
z-th row of I/a by y\. Once receiving the z-th row y\ ofyA, B 
performs CS recovery on [y\ -y B y\-y% • • • y\ - y B ] T - 
By CS recovery on [y\ - y B y\ - y% • • • y\ - y B } 7 \ we 
mean that i\ -minimization is applied to the two columns in 
[Va -VbVa-Vb "' Va-VbY separatively. Because the 
entries in IBLTa and IBLTb are assumed to be integers, 
quantization is applied to the recovered result. Suppose that 
B obtains a recovery result IBLTa-b after l\ -minimization 
is applied to [y\ - y B y\ - y% • • • y\ - j^F- B then 
proceeds to the LIST-ENTRIES operation on IBLTa-b and 
checks whether the LIST-ENTRIES operation succeeds or not. 
If the LIST-ENTRIES operation succeeds, B sends a positive 
acknowledgment meaning "stop sending more measurements" 
to A, and host B reconciles Sb with Sa, with the 
and A B extracted from IBLT ' A -b- If the LIST-ENTRIES 
operation fails, B waits for the next measurement y l A 1 and 
again performs the above operations on y\ through y 1 ^ 1 . 

The above setting and procedures remain the same in the 
case of n < d < 2n except that IBLTa and IBLTb of length 
at most 4n are needed instead. Note that 4n corresponds to 
the extreme case of d = 2n. 

Figure [T] illustrates how CS-IBLT works. Hosts A and 
B possess S A = {1,2,..., 7} and S B = {2, 3,..., 8}, 
respectively. In the following, we omit the second column of 
IBLT in our CS-IBLT algorithm for representation simplicity. 
That is, we omit the counting Bloom filter part. Observe that 
A A = {1}, A B = {8}, and d = 2. Note that because of n = 7, 
IBLTs are of length 14. This corresponds to the requirement 



in Sec. II-C that IBLTs of length 2n need to be allocated. 
Suppose that k = 2 hash functions are used in the IBLT 
in CS-IBLT. IBLTa and IBLTb are derived according to 
the hash positions and then IBLTa — IBLTb is calculated. 
With CS-IBLT, A only needs to send the first 6 entries in 
yA to B. That is, only six entries of yA — yB are sufficient 
for B to exactly recover the IBLTa — IBLTb. From the 
recovered IBLTa — IBLTb, IBLTa-b, we can extract 1 
and —8 according to the IBLT principles in Sec. II-B Based 
on the rule described in Sec. II-D B knows that = {1}, 
A B = {8}. 

D. Analysis 

The following is the key relationship behind our proposed 
CS-IBLT algorithm is: 

VA-VB = WBLTa - IBLTb). (2) 

The CS recovery based onyA—yB can generate an approxima- 
tion IBLTa-b of IBLTa — IBLTb • When the mmber m of 
measurements is sufficient in the CS recovery, IBLT a- b is 
nearly identical to IBLTa — IBLTb. Based on the principles 
of IBLT construction, IBLTa — IBLTb can be thought of as 
an IBLT with elements in A^ and in A5, where Ab is defined 
as the set {0 — e|e G Ab}. Thus, B first lists all the elements 
in IBLT A-B- Those positive elements are categorized as A^ 
and those negative ones are categorized as A^. 
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Fig. 1: An illustration of CS-IBLT. 



On the other hand, when the number m of measurements 
is insufficient for the exact recovery of IBLTa — IBLTb. 
That is, IBLT a- b is significantly deviated from IBLTa — 
IBLTb , B will be aware of this failed recovery because after 
the LIST-ENTRIES operation is applied to such IBUT A -b, 
the LIST-ENTRIES operationMls with high probability. Note 
that the reconstructed array IBLT a- b behaves like a random 
one when an insufficient number of measurements is used. The 
LIST-ENTRIES operation is unlikely to be successful on a 
random array. Therefore, the decoding procedure will proceed 
with high probability until IBLT A -b ~ IBLTa ~ IBLTb 
is achieved. 

The number of measurements required to recover IBLTa — 
IBLTb determines the communication cost of CS-IBLT. 
Recall that we are interested in recovering IBLTa — IBLTb 
from y A - y B = (IBLTa ~ IBLT B ), and the theory 
of CS states that the number of required measurements can 
be as small as cs log ^ , where s is the number of nonzero 
entries in the vector to be recovered. Observe that the IBLT, 
IBLTa — IBLTb, is constructed by adding elements in Sa 
and removing elements in Sb- Based on the IBLT principles 



in Sec. II-B the elements commonly shared between A and 
B, which are the elements in (Sa U Sb) \ (A a U Ab), will 
be eliminated and only the elements in the set difference 
A A U A B remain in IBLT A - IBLTb. Recall that cslog J 
measurements are needed for accurate CS recovery, where 
s is the number of nonzero elements. Thus, as the vector 
to be recovered is IBLT a- b with at most kd nonzero 
entries, min{2n, ckdlog ^} measurements are sufficient for 
the CS recovery, where k and d denote the number of hash 
functions used in IBLT and the inherent size of set differences, 
respectively. 

As reported in (4), the length of IBLT with n elements 
should be at least 1.2n to ensure the successful execution of 
the LIST-ENTRIES operation in the case of k = 3. However, 
the value of 1.2n is estimated based on an inherent assumption 
that the inserted elements are all positive. Based on the IBLT 
principles in Sec. II-B IBLTa — IBLTb can be regarded as 
an IBLT with elements of A^ and A#. Since there could be 
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some negative elements in and A^, we suggest to use 2n, 
rather than 1.2n, according to our empirical experience. 

E. Comparison 

In the case that prior knowledge about d is unavailable, the 
use of IBLT incurs a large amount of wasted communication. 
In particular, a reasonably first guess is d = and host A 
sends IBLT of size 2d to B. If the real d is smaller then 
d, B can obtain and A# successfully. Essentially, 2 • d 
communications are sufficient for finding the set differences 
and this means that we incur unnecessary communication cost 
which can be as large as2-| — 2-l = n — 2. This extreme 
case occurs when d = 1. 

If the real d is greater than d, then the LIST-ENTRIES 
operation will be failed, and B keeps waiting for the subse- 
quent measurements from A. This time, A adopts a binary 
search-like approach to progressively have next d — \n. 
Afterwards, hosts A and B repeat the above procedures until 
B can empty IBLT a-b- In the extreme case of d = n, 
2(§ + ^p + . . .) = 0(n log n) communication cost is required. 
This performance is even worse than that of straightforward 
method in which Sa is sent to B directly. 

On the other hand, in the case of d = 1, if CS-IBLT 
is used, since the array IBLT a — IBLTb is very sparse 
(approximately only d • k = k nonzero entries), only a very 
small number of measurements are needed. In the case of 
d = n, 2n measurements are sufficient for the CS recovery 
in CS-IBLT. Such communication cost occurs when all of the 
rows of yA are transmitted. 

III. Numerical Experiments 

In this section we demonstrate and compare the performance 
of IBLT and CS-IBLT via numerical experiments. Figure [2] 
compares the performance of both methods under the assump- 
tion that prior knowledge about d is not available. 

In these experiments, k = 2 hash functions are used in both 
IBLT and CS-IBLT. In CS-IBLT, the random measurement 
matrix is Gaussian distributed. In Figure 2a | Sa \ 
n = 200 and d is varied from 1 to 200. 



\Sb\ = 
One can see in 
Figure |2a| that communication cost of CS-IBLT increases as d 
increases due to the fact that the larger d implies more nonzero 
entries in IBLT a— IBLTb. In essence, the procedures in CS- 
IBLT here are roughly like applying CS measurement matrix 
to a /cd-sparse array IBLT a — IBLTb and then deriving the 
CS recovered array IBLT a-b- On the other hand, in IBLT, 
because no prior knowledge about d can be used, the guessed 
d, d = f , is used initially. This choice of d enables B to 
decode the received IBLT, resulting in a flat curve from d = 1 



to d— 100. Similar observations can be made in Figure 2b 



CS-IBLT shows its main advantage when d is relatively 
small and large. In the case of small d, the overestimated d 
incurs unnecessary communication but different measurements 
are adaptively transmitted one by one in CS-IBLT. The sending 
stops immediately after the successful recovery of IBLT a — 
IBLTb- In the case of large d, several underestimated d in 
IBLT incurs useless communication but because of its adaptive 
property, even in the worst case, 2n measurements can enable 




(a) 



(b) 



Fig. 2: The size of set differences v.s. communication cost (a) 
n 200 and k = 2 (b) n 1000 and k = 2. 



the successful recovery of IBLT A - IBLTb- CS-IBLT is 
inferior to IBLT only in the case of moderate d, which means 
that the initially guessed d, d, is pretty close to the real d. 
The rationale behind this is that the communication cost of 
CS-IBLT is still limited by the theory of CS. That is, it is 
still dependent on n. However, if d « d, we can think that 
IBLT with prior knowledge about d is utilized, resulting in 
only 2d communication. Hence, in such cases, CS-IBLT is 
less efficient than IBLT in terms of communication cost. 

IV. Conclusion 

We present a novel algorithm, CS-IBLT, to address the rec- 
onciliation problem. According to our theoretical analysis and 
numerical experiments, CS-IBLT is superior to the previous 
methods in terms of communication cost in most cases under 
the assumption that no prior information is available. 
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